By Nanmanat Disayakamonpan
Data Analytics, Constructor University
2.1 Understanding Data
2.2 Importing Libraries
2.3 Defining Pipeline
3.1 Feature Overview
3.2 Demographic Aspects
3.3 School Related Aspects
3.4 Social Aspects
3.5 Other Factors
3.6 Correlation Matrix
4.1 Checking Missing Values
4.2 Dropping Non-necessary Columns
4.3 Changing Variable Type (Label Encoding)
4.4 Feature Selection Using SelectKBest
4.5 Splitting the Data into Train and Test Sets
5.1 Task 1: Regression Model
(1) Linear Regression
(2) Ridge Regression
(3) Decision Tree Regression
(4) Random Forest Regression
(5) Support Vector Regression
(6) Gradient Boosting Regression
5.2 Task 2: Classification Model
(1) Logistic Regression
(2) Decision Tree Classification
(3) Random Forest Classification
(4) Gradient Boosting Classification
(5) Support Vector Classification
(6) KNeighbors Classification
6.1 Task 1: Regression Model Performance Comparison
6.2 Task 2: Classification Model Performance Comparison
6.3 Conclusion
The purpose of this project is to delve into the intricate factors influencing student success in secondary education, specifically focusing on the subjects of Mathematics and Portuguese. Leveraging a comprehensive dataset comprising student grades, demographic attributes, social factors, and school-related features collected through school reports and questionnaires, this analysis aims to uncover the underlying determinants of academic achievement.
The overarching objective of this endeavor is to gain a deeper understanding of the elements that shape performance outcomes in Mathematics and Portuguese. Two distinct tasks have been delineated to achieve this aim:
Task 1: Build a predictive model for the target variable G1.Port, excluding any other grade features, while including variables such as activities, famrel, and failures.Math, with the exclusion of Mjob and Fjob.
Task 2: Bin the target variable G1.Port into four equally populated bins and utilize this newly derived categorical variable as the response for a classification model. Similar to Task 1, the model should incorporate variables such as activities, famrel, and failures.Math, while omitting Mjob and Fjob.
Abstract
This report embarks on a methodical exploration of the data, commencing with the construction of a generic pipeline to delineate the project's outline. Subsequently, a series of data exploratory analyses were conducted, encompassing histograms to discern overall feature distributions and boxplots to identify outliers. Further investigation was then undertaken to examine the interplay between various variables and their potential impact on success in Mathematics and Portuguese, categorized into demographic, school-related, social, and miscellaneous factors. Additionally, a correlation analysis was conducted using a heatmap to elucidate any underlying relationships between variables.
Preceding model development, meticulous data preprocessing was undertaken, involving the identification and handling of missing values, elimination of non-necessary columns, conversion of variable types via label encoding, selection feature using SelectKBest to optimize model performance, and splitting the data into train and test sets.
The modeling phase entailed the implementation of both regression and classification models to address the defined tasks. For the regression task, a suite of models was employed, including (1) Linear Regression, (2) Ridge Regression, (3) Decision Tree Regression, (4) Random Forest Regression, (5) Support Vector Regression, and (6) Gradient Boosting Regression.
In the classification task, a diverse array of classifiers, namely (1) Logistic Regression, (2) Decision Tree Classification, (3) Random Forest Classification, (4) Gradient Boosting Classification, (5) Support Vector Classification, and (6) KNeighbors Classification, were leveraged. Subsequently, confusion matrix, ROC, and AUC curves were generated to facilitate the evaluation of model performance.
In conclusion, a comprehensive comparison report was compiled, summarizing the efficacy of each model and identifying the most promising approaches in terms of accuracy and performance.
The data in the file is about student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features and it was collected by using school reports and questionnaires
Below are the field names along with their descriptions.
| Variable Name | Description | Values |
|---|---|---|
| school | student’s school1 | “GP” - Gabriel Pereira or “MS” - Mousinho da |
| sex | student's sex | "F" - Female or "M" - Male |
| age | student's age | numeric: from 15 to 22 |
| address | student’s home address type | “U” - urban or “R” - rural |
| famsize | family size | “LE3” - less or equal to 3 or “GT3” - greater than 3 |
| Pstatus | parent’s cohabitation status | “T” - living together or “A” - apart |
| Medu | mother’s education | numeric: 0 - none, 1 - primary education, 2 – 5th to 9th grade, 3 – secondary education, 4 – higher education |
| Fedu | father's education | numeric: 0 - none, 1 - primary education, 2 – 5th to 9th grade, 3 – secondary education, 4 – higher education |
| Mjob | mother’s job | “teacher”, “health” care related, “services”, “at_home” or “other” |
| Fjob | father’s job | “teacher”, “health” care related, “services”, “at_home” or “other” |
| reason | reason to choose this school | “home”, “reputation”, “course” preference or “other” |
| guardian | student’s guardian | “mother”, “father” or “other” |
| traveltime | home to school travel time | numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, 4 - >1 hour |
| studytime | weekly study time | numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, 4 - >10 hours |
| failures | number of past class failures | numeric: n if 1<=n<3, else 4 |
| schoolsup | extra educational support | Yes or No |
| famsup | family educational support | Yes or No |
| paid | extra paid classes within the course subject | Yes or No |
| activities | extra-curricular activities | Yes or No |
| nursery | attended nursery school | Yes or No |
| higher | wants to take higher education | Yes or No |
| internet | Internet access at home | Yes or No |
| romantic | with a romantic relationship | Yes or No |
| famrel | quality of family relationships | numeric: from 1 - very bad to 5 - excellent |
| freetime | free time after school | numeric: from 1 - very low to 5 - very high |
| goout | going out with friends | numeric: from 1 - very low to 5 - very high |
| Dalc | workday alcohol consumption | numeric: from 1 - very low to 5 - very high |
| Walc | weekend alcohol consumption | numeric: from 1 - very low to 5 - very high |
| health | current health status | numeric: from 1 - very bad to 5 - very good |
| absences | number of school absences | numeric: from 0 to 93 |
| G1 | first period grade | numeric: from 0 to 20 |
| G2 | second period grade | numeric: from 0 to 20 |
| G3 | final grade | numeric: from 0 to 20 |
It is important to note that the features failures, paid, absences, G1, G2, G3 are recorded for the Math subject and the Portuguese subject, hence a corresponding suffix has been added to the variable name.
However, it is crucial to acknowledge that the dataset comprises 803 students from Gabriel Pereira (GP) school and 114 students from Mousinho da Silveira (MS) school. This discrepancy in sample sizes underscores the necessity for cautious interpretation, as larger sample sizes, such as that of GP school, may exert more influence on the overall trends observed. Furthermore, awareness of this imbalance is essential to avoid potential biases and ensure accurate conclusions regarding the comparative academic performance between the two schools.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder, label_binarize
from sklearn.feature_selection import SelectKBest, f_regression, f_classif, RFE
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVR, SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, AdaBoostClassifier, GradientBoostingRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, mean_squared_error, mean_absolute_error, r2_score, roc_curve, auc, recall_score, confusion_matrix, precision_score, f1_score, accuracy_score, classification_report
from catboost import CatBoostClassifier
from sklearn import metrics
from itertools import cycle
In this project, I established a generic pipeline designed to handle various stages of data analysis and model evaluation: (1) Reading the Data, (2) Data Cleaning Check, (3) Splitting the Data, (4) Simple Data Preprocessing, (5) Data Exploration, (6) Training a Standard Model, and (7) Evaluating the Model.
In this project, I chose to execute only specific steps of the pipeline according to my specific needs and objectives on my analysis and exploration because this approach enables a targeted and efficient investigation into the data, enhancing the depth and quality of subsequent analyses.
# Step 1: Read the data
def read_data(file_path):
df = pd.read_csv(file_path)
dataset_size = df.shape
dataset_info = df.info()
return df, dataset_size, dataset_info
# Step 2: Simple data cleaning check
def data_cleaning_check(df):
# Check for odd variable types and missing values
print("Data Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
return df
# Step 3: Split the data into training and test sets
def split_data(df):
X = df.drop(columns='G1.Port')
y = df['G1.Port']
return train_test_split(X, y, test_size=0.2, random_state=42)
# Step 4: Simple data preprocessing
def simple_data_preprocessing(X_train, X_test, y_train):
# Remove cases with missing values
X_train = X_train.dropna()
y_train = y_train.loc[X_train.index] # Keep only corresponding labels
X_test = X_test.dropna()
return X_train, X_test, y_train
# Step 5.1: Histogram data exploration
#def histogram_data_exploration(df):
# Select only numerical columns
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
# Calculate the number of rows and columns for subplots
num_columns = len(numeric_columns)
num_rows = (num_columns) // 2 # Ensure at least 2 columns per row
# Set up the matplotlib figure and axes
fig, axs = plt.subplots(nrows=num_rows, ncols=2, figsize=(30, 6 * num_rows))
# Flatten the axes array for easy iteration
axs = axs.flatten()
# Loop through each numerical column and plot a histogram
for i, column in enumerate(numeric_columns):
sns.histplot(df[column], ax=axs[i], bins=15, kde=True, color='skyblue', edgecolor='black')
axs[i].set_title(f'Histogram of {column}')
axs[i].set_xlabel('Value')
axs[i].set_ylabel('Frequency')
# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()
return df
# Step 5.2: Boxplot data exploration
#def boxplot_data_exploration(df):
# Select only numerical columns
numeric_columns = df.select_dtypes(include=[np.number]).columns[~df.select_dtypes(include=[np.number]).columns.str.contains('Unnamed')]
# Calculate the number of rows and columns for subplots
num_columns = len(numeric_columns)
num_rows = (num_columns + 2) // 3 # Ensure at least 3 plots per row
num_cols = min(num_columns, 3)
# Set up the matplotlib figure and axes
fig, axs = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15 * num_cols, 5 * num_rows))
# Flatten the axes array for easy iteration
axs = axs.flatten()
# Loop through each numerical column and plot a boxplot
for i, column in enumerate(numeric_columns):
sns.boxplot(x=df[column], ax=axs[i], width=0.3, palette="Set3")
axs[i].set_title(f'Boxplot of {column}')
axs[i].set_xlabel('')
# Remove empty subplots
for i in range(num_columns, num_rows * num_cols):
fig.delaxes(axs[i])
# Adjust layout
plt.tight_layout()
plt.show()
return df
# Step 6: Train a standard model (commented out)
def train_model(X_train, y_train):
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
return model
# Step 7: Evaluate the model (commented out)
def evaluate_model(model, X_test, y_test):
y_pred = model.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))
# Define the pipeline
data_pipeline = Pipeline([
('read_data', read_data),
('data_cleaning_check', data_cleaning_check),
('split_data', split_data),
('simple_data_preprocessing', simple_data_preprocessing),
# ('histogram_data_exploration', histogram_data_exploration),
# ('boxplot_data_exploration', boxplot_data_exploration),
('train_model', train_model), # Commented out
('evaluate_model', evaluate_model) # Commented out
])
df, dataset_size, dataset_info = data_pipeline.named_steps['read_data']('final project_data.csv')
data_cleaned = data_pipeline.named_steps['data_cleaning_check'](df)
X_train, X_test, y_train, y_test = data_pipeline.named_steps['split_data'](df)
X_train_processed, X_test_processed, y_train_processed = data_pipeline.named_steps['simple_data_preprocessing'](X_train, X_test, y_train)
# data_exploration1 = data_pipeline.named_steps['histogram_data_exploration'](df)
# data_exploration2 = data_pipeline.named_steps['boxplot_data_exploration'](df)
#model = data_pipeline.named_steps['train_model'](X_train_processed, y_train_processed)
#data_pipeline.named_steps['evaluate_model'](model, X_test_processed, y_test)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 917 entries, 0 to 916 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 917 non-null float64 1 school 917 non-null object 2 sex 917 non-null object 3 age 917 non-null int64 4 address 917 non-null object 5 famsize 917 non-null object 6 Pstatus 917 non-null object 7 Medu 917 non-null int64 8 Fedu 917 non-null int64 9 Mjob 917 non-null object 10 Fjob 917 non-null object 11 reason 917 non-null object 12 guardian 917 non-null object 13 traveltime 917 non-null int64 14 studytime 917 non-null int64 15 schoolsup 917 non-null object 16 famsup 917 non-null object 17 activities 917 non-null object 18 nursery 917 non-null object 19 higher 917 non-null object 20 internet 917 non-null object 21 romantic 917 non-null object 22 famrel 917 non-null int64 23 freetime 917 non-null int64 24 goout 917 non-null int64 25 Dalc 917 non-null int64 26 Walc 917 non-null int64 27 health 917 non-null int64 28 failures.Math 917 non-null int64 29 paid.Math 917 non-null object 30 absences.Math 917 non-null int64 31 G1.Math 917 non-null int64 32 G2.Math 917 non-null int64 33 G3.Math 917 non-null int64 34 failures.Port 917 non-null int64 35 paid.Port 917 non-null object 36 absences.Port 917 non-null int64 37 G1.Port 917 non-null int64 38 G2.Port 917 non-null int64 39 G3.Port 917 non-null int64 dtypes: float64(1), int64(21), object(18) memory usage: 286.7+ KB Data Types: Unnamed: 0 float64 school object sex object age int64 address object famsize object Pstatus object Medu int64 Fedu int64 Mjob object Fjob object reason object guardian object traveltime int64 studytime int64 schoolsup object famsup object activities object nursery object higher object internet object romantic object famrel int64 freetime int64 goout int64 Dalc int64 Walc int64 health int64 failures.Math int64 paid.Math object absences.Math int64 G1.Math int64 G2.Math int64 G3.Math int64 failures.Port int64 paid.Port object absences.Port int64 G1.Port int64 G2.Port int64 G3.Port int64 dtype: object Missing Values: Unnamed: 0 0 school 0 sex 0 age 0 address 0 famsize 0 Pstatus 0 Medu 0 Fedu 0 Mjob 0 Fjob 0 reason 0 guardian 0 traveltime 0 studytime 0 schoolsup 0 famsup 0 activities 0 nursery 0 higher 0 internet 0 romantic 0 famrel 0 freetime 0 goout 0 Dalc 0 Walc 0 health 0 failures.Math 0 paid.Math 0 absences.Math 0 G1.Math 0 G2.Math 0 G3.Math 0 failures.Port 0 paid.Port 0 absences.Port 0 G1.Port 0 G2.Port 0 G3.Port 0 dtype: int64
This DataFrame contains information on student attributes and academic performance across multiple subjects.
Size and Structure: The dataset contains 917 entries and 40 columns, with each row representing a student and columns representing attributes including demographics, family background, study habits, and academic outcomes.
Data Types: The dataset includes numeric data types (integers and floats) for variables like age and grades, and categorical variables (objects/strings) for attributes such as school and sex.
Missing Values: There are no missing values in the dataset, with all 917 entries having non-null values in each column.
This dataset offers a comprehensive view for analyzing factors influencing student performance and behavior.
Before proceeding to model building, the Exploratory Data Analysis (EDA) section is a crucial step in the data analysis process where we aim to understand the dataset and uncover insights through various perspectives.
numeric_columns = df.select_dtypes(include=['int64', 'float64']).columns
# Calculate the number of rows and columns for subplots
num_columns = len(numeric_columns)
num_rows = (num_columns) // 2 # Ensure at least 2 columns per row
# Set up the matplotlib figure and axes
fig, axs = plt.subplots(nrows=num_rows, ncols=2, figsize=(30, 6 * num_rows))
# Flatten the axes array for easy iteration
axs = axs.flatten()
# Loop through each numerical column and plot a histogram
for i, column in enumerate(numeric_columns):
sns.histplot(df[column], ax=axs[i], bins=15, kde=True, color='skyblue', edgecolor='black')
axs[i].set_title(f'Histogram of {column}')
axs[i].set_xlabel('Value')
axs[i].set_ylabel('Frequency')
# Adjust layout to prevent overlapping
plt.tight_layout()
plt.show()
The histogram visualization provides insights into various aspects:
Age Distribution: Students' ages predominantly cluster around 17 years.
Parental Education: others tend to have higher education levels compared to fathers.
Commute Time and Study Time: Most students have short commute times and allocate 2 to 5 hours for studying.
Family Relationships and Free Time: Students generally report positive family relationships and ample free time
Social Activities: Students frequently engage in social activities with friends.
Alcohol Consumption: Minimal alcohol consumption is reported on both weekdays and weekends.
Health Status: The majority of students report good health..
Academic Performance: Low failure rates and consistent attendance are observed.
Grades Distribution: The majority of students report good health.
# Select only numerical columns
numeric_columns = df.select_dtypes(include=[np.number]).columns[~df.select_dtypes(include=[np.number]).columns.str.contains('Unnamed')]
# Calculate the number of rows and columns for subplots
num_columns = len(numeric_columns)
num_rows = (num_columns + 2) // 3 # Ensure at least 3 plots per row
num_cols = min(num_columns, 3)
# Set up the matplotlib figure and axes
fig, axs = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(15 * num_cols, 5 * num_rows))
# Flatten the axes array for easy iteration
axs = axs.flatten()
# Loop through each numerical column and plot a boxplot
for i, column in enumerate(numeric_columns):
sns.boxplot(x=df[column], ax=axs[i], width=0.3, palette="Set3")
axs[i].set_title(f'Boxplot of {column}')
axs[i].set_xlabel('')
# Remove empty subplots
for i in range(num_columns, num_rows * num_cols):
fig.delaxes(axs[i])
# Adjust layout
plt.tight_layout()
plt.show()
The box plot reveals outliers in various variables, notably in "absences.Math," "absences.Port," "failures.Math," and "failures.Port." These outliers may stem from extreme circumstances like prolonged illness or personal emergencies, as well as recurring behavioral patterns such as habitual absenteeism or academic challenges.
# Melt the DataFrame to create a long-form representation suitable for visualization
df_melted1 = df.melt(id_vars='sex', value_vars=['G1.Math', 'G2.Math', 'G3.Math', 'G1.Port', 'G2.Port', 'G3.Port'], var_name='Subject', value_name='Grade')
grouping = df_melted1.groupby(by=["sex", "Subject"])["Grade"].mean().reset_index()
# Map original values to descriptive names
status_mapping = {'F': 'Female', 'M': 'Male'}
grouping['sex'] = grouping['sex'].map(status_mapping)
# Create a grouped bar plot with Plotly
fig = px.bar(grouping, x='Subject', y='Grade', color='sex', barmode='group',
title='The Average Grade as a Performance Comparison: between males and females', labels={'Grade': 'Average Grade'},
color_discrete_map={'Female': 'pink', 'Male': 'skyblue'})
fig.update_layout(xaxis_title='Subject', yaxis_title='Grade', showlegend=True)
fig.update_traces(marker_line_color='white', marker_line_width=1.5)
fig.show()
The bar plot displays average grades by gender and subject. Female students generally outperformed males in Portuguese, while males achieved higher grades in Mathematics. This gender-based disparity in academic performance suggests areas for further investigation.
# Melt the DataFrame to create a long-form representation suitable for visualization
df_melted2 = df.melt(id_vars='Pstatus', value_vars=['G1.Math', 'G2.Math', 'G3.Math', 'G1.Port', 'G2.Port', 'G3.Port'], var_name='Subject', value_name='Grade')
grouping = df_melted2.groupby(by=["Pstatus", "Subject"])["Grade"].mean().reset_index()
# Map original values to descriptive names
status_mapping = {'T': 'Together', 'A': 'Apart'}
grouping['Pstatus'] = grouping['Pstatus'].map(status_mapping)
# Create a grouped bar plot with Plotly
fig = px.bar(grouping, x='Subject', y='Grade', color='Pstatus', barmode='group',
title='The Average Grade as a Performance Comparison: by Parent\'s Cohabitation Status',
labels={'Grade': 'Average Grade'}, color_discrete_map={'Together': 'blue', 'Apart': 'red'})
# Update layout and legend
fig.update_layout(xaxis_title='Subject', yaxis_title='Grade', showlegend=True)
fig.update_traces(marker_line_color='white', marker_line_width=1.5)
fig.show()
The visualization indicates that students whose parents are living apart (represented by the red bars) tend to have better academic performance compared to those whose parents are living together (represented by the blue bars) across both subjects.
# Grouping the data by school and reason and counting the number of students for each combination
reason_counts = df.groupby(['school', 'reason']).size().reset_index(name='count')
# Map school names to their full names
school_names = {'GP': 'Gabriel Pereira', 'MS': 'Mousinho da Silveira'}
reason_counts['school'] = reason_counts['school'].map(school_names)
# Define custom colors for each reason
custom_colors = {'course': 'purple', 'home': 'blue', 'reputation': 'pink', 'other': 'skyblue'}
# Create a bar chart with Plotly
fig = px.bar(reason_counts, x='school', y='count', color='reason',
title='Reasons for Choosing School',
labels={'school': "Student's School", 'count': 'Number of Students', 'reason': 'Reason'},
barmode='group',
color_discrete_map=custom_colors) # Specify custom colors
# Update x-axis tick labels
fig.update_xaxes(tickvals=['Gabriel Pereira', 'Mousinho da Silveira'])
fig.update_traces(showlegend=True, legendgroup=True)
fig.update_layout(xaxis_title="Student's School", yaxis_title='Number of Students', showlegend=True)
fig.update_traces(marker_line_color='white', marker_line_width=1.5)
fig.show()
The bar chart illustrates reasons for students choosing between Gabriel Pereira and Mousinho da Silveira schools:
Course Preference: Most students prioritize the academic programs or courses offered by the schools (purple bars).
Home Proximity: A significant number of students choose based on the school's proximity to their homes (blue bars).
Reputation: Some students consider the school's reputation in their decision-making (pink bars).
Other Reasons: A smaller proportion of students cite various other factors influencing their choice (sky blue bars).
In summary, course preference emerges as the primary factor influencing students' decisions.
# Melt the DataFrame to have a tidy format suitable for plotting
df_melted4 = df.melt(id_vars='school', value_vars=['G1.Math', 'G2.Math', 'G3.Math', 'G1.Port', 'G2.Port', 'G3.Port'], var_name='Subject', value_name='Grade')
grouping = df_melted4.groupby(by=["school", "Subject"])["Grade"].mean().reset_index()
# Map original values to descriptive names
status_mapping = {'GP': 'Gabriel Pereira', 'MS': 'Mousinho da Silveira'}
grouping['school'] = grouping['school'].map(status_mapping)
# Create a grouped bar plot with Plotly
fig = px.bar(grouping, x='Subject', y='Grade', color='school', barmode='group',
title='The Average Grade as a Performance Comparison: by schools', labels={'Grade': 'Average Grade'},
color_discrete_map={'Gabriel Pereira': 'blue', 'Mousinho da Silveira': 'green'})
fig.update_layout(xaxis_title='Subject', yaxis_title='Grade', showlegend=True)
fig.update_traces(marker_line_color='white', marker_line_width=1.5)
fig.show()
The grouped bar plot compares average grades between students from Gabriel Pereira (blue bars) and Mousinho da Silveira (green bars) across subjects. Gabriel Pereira students generally achieve higher grades in both Mathematics and Portuguese compared to Mousinho da Silveira students, suggesting potential differences in teaching methods or student demographics
# Melt the DataFrame to have a tidy format suitable for plotting
df_melted5 = df.melt(id_vars=['paid.Math', 'paid.Port'],
value_vars=['G1.Port', 'G2.Port', 'G3.Port', 'G1.Math', 'G2.Math', 'G3.Math'],
var_name='Subject', value_name='Grade')
# Plotting
plt.figure(figsize=(15, 5))
# Plotting the first facet for paid.Math
plt.subplot(1, 2, 1)
sns.barplot(data=df_melted5, x='Subject', y='Grade', hue='paid.Math', palette='Set1', ci=None)
plt.xlabel('Subject')
plt.ylabel('Average Grade')
plt.title('Average Grades by Extra Paid Classes Within Mathematics (Paid.Math)')
# Plotting the second facet for paid.Port
plt.subplot(1, 2, 2)
sns.barplot(data=df_melted5, x='Subject', y='Grade', hue='paid.Port', palette='Set1', ci=None)
plt.xlabel('Subject')
plt.ylabel('Average Grade')
plt.title('Average Grades by Extra Paid Classes Within Portuguese (Paid.Port)')
# Adjust layout
plt.tight_layout()
plt.show()
The analysis of academic performance based on enrollment in extra paid classes reveals intriguing findings. Students opting for the additional Mathematics class achieved higher average grades in both Mathematics and Portuguese subjects. Conversely, those enrolled in the extra-paid Portuguese class tended to have lower average grades in both subjects, except for the final Mathematics grade. This implies a potential trade-off between investing in additional support for one subject and performance in the other.
# Define the facets and their corresponding columns
facetstime = {
'studytime': 'studytime',
'traveltime': 'traveltime'
}
# Split the facets into groups of two
facet_groups = [list(facetstime.keys())[i:i+2] for i in range(0, len(facetstime), 2)]
# Iterate over each facet group
for facet_group in facet_groups:
# Create subplots for the facet group
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))
# Iterate over each facet in the group
for i, facet in enumerate(facet_group):
column = facetstime[facet]
# Group the data by the facet column and calculate the average grades for each subject
average_grades = df.groupby(column)[['G1.Port', 'G2.Port', 'G3.Port', 'G1.Math', 'G2.Math', 'G3.Math']].mean()
# Create the line plots for the facet
ax = axes[i]
ax.plot(average_grades.index, average_grades['G1.Port'], marker='o', label='G1.Port')
ax.plot(average_grades.index, average_grades['G2.Port'], marker='o', label='G2.Port')
ax.plot(average_grades.index, average_grades['G3.Port'], marker='o', label='G3.Port')
ax.plot(average_grades.index, average_grades['G1.Math'], marker='o', label='G1.Math')
ax.plot(average_grades.index, average_grades['G2.Math'], marker='o', label='G2.Math')
ax.plot(average_grades.index, average_grades['G3.Math'], marker='o', label='G3.Math')
# Customize the plot for the facet
ax.set_xlabel(facet)
ax.set_ylabel('Average Grade')
ax.set_title(f'Average Grades by {facet}')
ax.legend()
# Set x-ticks to display only integer values
ax.set_xticks(range(1, len(average_grades.index) + 1))
plt.tight_layout()
plt.show()
It is important to note that the numeric values for study time and travel time represent different time intervals.
Study time: The scale ranges from 1 to 4, indicating "<2 hours," "2 to 5 hours," "5 to 10 hours," and ">10 hours" respectively.
Travel time: Similarly, the scale ranges from 1 to 4, representing "<15 minutes," "15 to 30 minutes," "30 minutes to 1 hour," and ">1 hour" respectively.
Students spending 3 to 5 hours per week studying tend to achieve higher average grades, suggesting an optimal range for study time. Excessive or insufficient study hours may lead to lower academic performance. Additionally, students with shorter commutes, particularly those under 15 minutes, demonstrate better academic performance compared to those with longer travel times, possibly due to reduced fatigue and more time for academic or extracurricular activities. Conversely, longer travel times, exceeding an hour, are associated with lower average grades, highlighting the potential impact of extended commutes on academic outcomes.
# Set up the figure and axes
plt.figure(figsize=(15, 5))
# Plot for failures.Math with hue as school
plt.subplot(1, 2, 1)
sns.countplot(data=df, x='failures.Math', hue='school', palette='viridis')
plt.title('Frequency of Failures in Math by School')
plt.xlabel('Failures (Math)')
plt.xticks([0, 1, 2, 3], ['1', '2', '3', '4+']) # Adjusted xticks
plt.ylabel('Frequency')
# Plot for failures.Port with hue as school
plt.subplot(1, 2, 2)
sns.countplot(data=df, x='failures.Port', hue='school', palette='viridis')
plt.title('Frequency of Failures in Portuguese by School')
plt.xlabel('Failures (Portuguese)')
plt.xticks([0, 1, 2, 3], ['1', '2', '3', '4+']) # Adjusted xticks
plt.ylabel('Frequency')
# Adjust layout
plt.tight_layout()
# Show plot
plt.show()
The visualization indicates that Gabriel Pereira (GP) school students tend to experience higher frequencies of failures in Mathematics compared to those from Mousinho da Silveira (MS). This difference is particularly notable for single failures and failures exceeding four. In contrast, the disparity in Portuguese failures is less pronounced, with GP students still exhibiting slightly higher frequencies, especially for single and multiple failures. Overall, GP students generally encounter more failures in both Mathematics and Portuguese subjects compared to MS students, with some exceptions like the frequency of two-time failures in Portuguese.
# Define the facets and their corresponding columns
facets = {
'famrel': 'famrel',
'health': 'health',
'freetime': 'freetime',
'goout': 'goout',
'Dalc': 'Dalc',
'Walc': 'Walc'
}
# Split the facets into groups of two
facet_groups = [list(facets.keys())[i:i+2] for i in range(0, len(facets), 2)]
# Iterate over each facet group
for facet_group in facet_groups:
# Create subplots for the facet group
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(14, 6))
# Iterate over each facet in the group
for i, facet in enumerate(facet_group):
column = facets[facet]
# Group the data by the facet column and calculate the average grades for each subject
average_grades = df.groupby(column)[['G1.Port', 'G2.Port', 'G3.Port', 'G1.Math', 'G2.Math', 'G3.Math']].mean()
# Create the line plot for the facet
ax = axes[i]
ax.plot(average_grades.index, average_grades['G1.Port'], marker='o', label='G1.Port')
ax.plot(average_grades.index, average_grades['G2.Port'], marker='o', label='G2.Port')
ax.plot(average_grades.index, average_grades['G3.Port'], marker='o', label='G3.Port')
ax.plot(average_grades.index, average_grades['G1.Math'], marker='o', label='G1.Math')
ax.plot(average_grades.index, average_grades['G2.Math'], marker='o', label='G2.Math')
ax.plot(average_grades.index, average_grades['G3.Math'], marker='o', label='G3.Math')
# Customize the plot for the facet
ax.set_xlabel(facet)
ax.set_ylabel('Average Grade')
ax.set_title(f'Average Grades by {facet}')
ax.legend()
# Set x-ticks to display only integer values
ax.set_xticks(range(1, len(average_grades.index) + 1))
plt.tight_layout()
plt.show()
The analysis of social factors reveals significant correlations with students' academic performance across different subjects:
Family Relationships (famrel): from 1 - very bad to 5 - excellent
Students rating their family relations as very bad or good tend to score higher in Portuguese, while those rating relations as bad perform better in Math. Surprisingly, students with very bad family relations perform better in Portuguese than Math. However, those rating relations as excellent achieve lower scores in both subjects compared to those rating them as good.
Health Status (health): from 1 - very bad to 5 - very good
Students reporting very bad health tend to achieve high grades, while those reporting very good health often achieve lower grades. Moderate health ratings correspond to average grades in Portuguese and lower grades in Math.
Free Time After School (freetime): from 1 - very low to 5 - very high
Students with low free time tend to excel in Portuguese compared to those with moderate-to-high free time. Similarly, those with low and very high free time perform well in Math.
Go Out With Their Friends (goout): from 1 - very low to 5 - very high
Students with low goout ratings achieve the best grades in both subjects, while those with very low or very high ratings tend to perform poorly.
Weekday and Weekend Alcohol Consumption (Dalc and Walc): from 1 - very low to 5 - very high
Students consuming very low alcohol levels during weekdays and weekends tend to achieve the best grades, while those with high consumption levels score lower.
In summary, positive family dynamics, moderate free time, limited socializing, and minimal alcohol consumption during weekdays and weekends correlate with better academic outcomes. Conversely, excessive socializing, poor health, and high alcohol consumption are associated with lower academic performance. These findings underscore the importance of a balanced lifestyle and supportive environments in fostering academic success.
# Melt the DataFrame to have a tidy format suitable for plotting
df_melted6 = df.melt(id_vars=['activities', 'higher', 'internet', 'romantic'],
value_vars=['G1.Port', 'G2.Port', 'G3.Port', 'G1.Math', 'G2.Math', 'G3.Math'],
var_name='Subject', value_name='Grade')
# Plotting
plt.figure(figsize=(20, 12)) # Adjusted figsize
# Plotting the first row
plt.subplot(2, 2, 1)
sns.barplot(data=df_melted6, x='Subject', y='Grade', hue='activities', palette='viridis', ci=None)
plt.xlabel('Subject')
plt.ylabel('Average Grade')
plt.title('Average Grades by Participation in Extracurricular Activities (activities)')
plt.xticks(rotation=45)
plt.subplot(2, 2, 2)
sns.barplot(data=df_melted6, x='Subject', y='Grade', hue='higher', palette='viridis', ci=None)
plt.xlabel('Subject')
plt.ylabel('Average Grade')
plt.title('Average Grades by Desire to Pursue Higher Education (higher)')
plt.xticks(rotation=45)
# Plotting the second row
plt.subplot(2, 2, 3)
sns.barplot(data=df_melted6, x='Subject', y='Grade', hue='internet', palette='viridis', ci=None)
plt.xlabel('Subject')
plt.ylabel('Average Grade')
plt.title('Average Grades by Internet Access at Home (internet)')
plt.xticks(rotation=45)
plt.subplot(2, 2, 4)
sns.barplot(data=df_melted6, x='Subject', y='Grade', hue='romantic', palette='viridis', ci=None)
plt.xlabel('Subject')
plt.ylabel('Average Grade')
plt.title('Average Grades by Involvement in a Romantic Relationship (romantic)')
plt.xticks(rotation=45)
# Adjust layout
plt.tight_layout()
plt.show()
The barplots reveal insights into students' academic performance based on different elements:
Participation in Extracurricular Activities (activities), Aspiration for Higher Education (higher), and Internet Access at Home (internet):
Students actively engaged in extracurricular activities, aspiring for higher education, and having internet access at home tend to perform better academically in both Mathematics and Portuguese subjects. These factors likely provide additional learning opportunities and resources, contributing to improved academic outcomes.
Involvement in Romantic Relationships (romantic):
Conversely, students involved in romantic relationships show slightly lower grades in both subjects compared to those not in relationships. However, variations exist across different grading periods, suggesting potential challenges in balancing academic and personal commitments. While romantic relationships enrich students' lives, they may impact academic focus and productivity.
These findings highlight the complex interplay between students' extracurricular engagements, aspirations, romantic involvements, and academic achievements, emphasizing the diverse influences shaping their educational paths.
This section focuses on visualizing correlation matrices to unveil relationships among numerical variables in the dataset. By preprocessing the data and computing the correlation matrix using Pearson's correlation coefficient, we gain insights into the interdependencies among attributes.
The correlation matrix is vital for exploratory data analysis, guiding feature selection, model development, and understanding data structure. Using 'LabelEncoder', categorical variables are transformed into numerical representations for correlation visualization in tabular and heatmap formats.
The goal is to provide a comprehensive overview of correlations' strength and direction between numerical attributes. Correlation values from -1 to 1 indicate negative and positive correlations in the heatmap. This visualization aids in identifying patterns and associations, informing further analysis and decision-making.
def object_to_int(dataframe_series):
if dataframe_series.dtype=='object':
dataframe_series = LabelEncoder().fit_transform(dataframe_series)
return dataframe_series
df2 = df.apply(lambda x: object_to_int(x))
# Select only numerical columns
numeric_df_cm = df2.select_dtypes(include=['int64', 'float64']).drop(columns=['Unnamed: 0', 'Mjob', 'Fjob', 'reason', 'guardian', 'G2.Port', 'G3.Port', 'G1.Math', 'G2.Math', 'G3.Math'])
# Compute the correlation matrix
corr_matrix = numeric_df_cm.corr(numeric_only=True)
corr_matrix
| school | sex | age | address | famsize | Pstatus | Medu | Fedu | traveltime | studytime | ... | Dalc | Walc | health | failures.Math | paid.Math | absences.Math | failures.Port | paid.Port | absences.Port | G1.Port | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| school | 1.000000 | -0.048571 | 0.397483 | -0.285071 | -0.013624 | 0.117246 | -0.193015 | -0.143305 | 0.324300 | -0.081089 | ... | 0.087894 | 0.038537 | -0.090097 | 0.065256 | -0.051933 | -0.100414 | 0.113978 | 0.048532 | -0.061991 | -0.173704 |
| sex | -0.048571 | 1.000000 | -0.049422 | -0.038677 | 0.152422 | -0.025504 | 0.101736 | 0.073469 | 0.068859 | -0.289687 | ... | 0.282248 | 0.221027 | 0.117452 | 0.086838 | -0.130729 | -0.063151 | 0.141508 | 0.153260 | -0.000195 | -0.199753 |
| age | 0.397483 | -0.049422 | 1.000000 | -0.151085 | 0.016224 | 0.091727 | -0.169454 | -0.153111 | 0.135217 | 0.042070 | ... | 0.066876 | 0.123352 | -0.052364 | 0.135294 | -0.033496 | 0.129157 | 0.214816 | -0.054872 | 0.104523 | -0.079498 |
| address | -0.285071 | -0.038677 | -0.151085 | 1.000000 | 0.058894 | -0.072920 | 0.182432 | 0.102969 | -0.394712 | -0.011738 | ... | -0.119227 | -0.126112 | 0.018856 | -0.098714 | 0.048724 | -0.086783 | -0.114003 | -0.051835 | -0.080585 | 0.168215 |
| famsize | -0.013624 | 0.152422 | 0.016224 | 0.058894 | 1.000000 | -0.163602 | 0.027248 | -0.011159 | 0.066961 | -0.077653 | ... | 0.127391 | 0.118847 | -0.035246 | -0.042045 | 0.029835 | 0.033102 | -0.040984 | -0.036568 | -0.023353 | 0.085486 |
| Pstatus | 0.117246 | -0.025504 | 0.091727 | -0.072920 | -0.163602 | 1.000000 | -0.136037 | -0.096290 | 0.043933 | 0.042824 | ... | 0.018049 | 0.101941 | 0.059233 | 0.009900 | 0.021740 | -0.174220 | 0.042273 | -0.057635 | -0.037216 | -0.086695 |
| Medu | -0.193015 | 0.101736 | -0.169454 | 0.182432 | 0.027248 | -0.136037 | 1.000000 | 0.651772 | -0.288301 | 0.038068 | ... | 0.015348 | -0.068524 | -0.031590 | -0.249072 | 0.147600 | 0.101776 | -0.204447 | 0.159739 | -0.003134 | 0.228178 |
| Fedu | -0.143305 | 0.073469 | -0.153111 | 0.102969 | -0.011159 | -0.096290 | 0.651772 | 1.000000 | -0.241192 | 0.025874 | ... | 0.002683 | -0.022464 | 0.024243 | -0.251542 | 0.120694 | 0.017163 | -0.178113 | 0.158431 | -0.001585 | 0.171631 |
| traveltime | 0.324300 | 0.068859 | 0.135217 | -0.394712 | 0.066961 | 0.043933 | -0.288301 | -0.241192 | 1.000000 | -0.110507 | ... | 0.202810 | 0.224172 | -0.004783 | 0.175791 | -0.056417 | -0.016805 | 0.100886 | -0.066487 | 0.034550 | -0.214055 |
| studytime | -0.081089 | -0.289687 | 0.042070 | -0.011738 | -0.077653 | 0.042824 | 0.038068 | 0.025874 | -0.110507 | 1.000000 | ... | -0.204715 | -0.218321 | -0.040365 | -0.182688 | 0.107538 | -0.066153 | -0.182901 | -0.034259 | -0.135360 | 0.260848 |
| schoolsup | -0.140542 | -0.049371 | -0.282361 | 0.029277 | -0.017170 | -0.048052 | -0.032069 | 0.050530 | -0.035811 | -0.000500 | ... | 0.054742 | -0.043258 | -0.026942 | -0.011405 | -0.019424 | 0.066349 | -0.002327 | 0.089417 | -0.032529 | -0.153273 |
| famsup | -0.174166 | -0.112996 | -0.137936 | 0.022863 | -0.053342 | 0.011612 | 0.141320 | 0.199248 | -0.029109 | 0.106014 | ... | -0.033679 | -0.029483 | 0.049690 | -0.025744 | 0.300750 | -0.006599 | -0.036986 | 0.127522 | 0.026518 | 0.048757 |
| activities | -0.174788 | 0.155642 | -0.096839 | -0.048981 | -0.018576 | 0.007399 | 0.077432 | 0.149744 | -0.042747 | 0.086255 | ... | -0.007167 | 0.003599 | 0.036680 | -0.099129 | -0.010560 | 0.019914 | -0.029042 | 0.087128 | -0.029461 | 0.095399 |
| nursery | -0.033557 | 0.025909 | -0.029211 | 0.088928 | 0.114200 | -0.078199 | 0.157572 | 0.141846 | -0.069293 | 0.042743 | ... | -0.075097 | -0.135625 | -0.069584 | 0.001735 | 0.042923 | 0.018629 | -0.007046 | 0.039606 | -0.004789 | 0.039044 |
| higher | 0.008092 | -0.125569 | -0.175885 | 0.057979 | 0.009305 | 0.004983 | 0.113334 | 0.122089 | -0.086388 | 0.147030 | ... | -0.034600 | -0.070846 | -0.103214 | -0.264845 | 0.142836 | -0.098816 | -0.171975 | 0.056295 | -0.131696 | 0.247081 |
| internet | -0.074907 | 0.053613 | -0.042415 | 0.231577 | 0.065463 | -0.037626 | 0.203484 | 0.167879 | -0.068350 | 0.078361 | ... | 0.052075 | 0.019384 | -0.046718 | -0.001766 | 0.122218 | 0.086877 | -0.140913 | 0.015029 | 0.089278 | 0.044543 |
| romantic | 0.030278 | -0.136165 | 0.116556 | 0.071713 | -0.005751 | -0.058814 | 0.063459 | 0.066846 | -0.012962 | 0.049948 | ... | 0.009161 | 0.004239 | 0.012765 | 0.000858 | 0.064030 | 0.124288 | -0.056391 | -0.012334 | 0.045154 | 0.019430 |
| famrel | -0.044110 | 0.034375 | 0.022206 | 0.076477 | -0.063672 | -0.046073 | -0.002100 | 0.002817 | -0.044951 | 0.018880 | ... | -0.117763 | -0.152291 | 0.072283 | -0.075803 | -0.026941 | -0.016398 | -0.013096 | 0.104040 | -0.043469 | -0.027898 |
| freetime | -0.022997 | 0.245625 | -0.051782 | 0.069986 | 0.004133 | 0.025431 | 0.051365 | 0.040725 | -0.064677 | -0.124605 | ... | 0.239070 | 0.173571 | 0.062020 | 0.077506 | -0.080370 | -0.033834 | 0.079269 | 0.013077 | 0.050288 | -0.092589 |
| goout | -0.045199 | 0.081110 | 0.125476 | 0.078702 | 0.000537 | 0.019968 | 0.043535 | 0.048071 | 0.034950 | -0.019840 | ... | 0.269763 | 0.436816 | -0.022615 | 0.163351 | -0.001873 | 0.070658 | 0.065673 | 0.033436 | 0.167111 | -0.094717 |
| Dalc | 0.087894 | 0.282248 | 0.066876 | -0.119227 | 0.127391 | 0.018049 | 0.015348 | 0.002683 | 0.202810 | -0.204715 | ... | 1.000000 | 0.653824 | 0.076176 | 0.109079 | 0.079078 | 0.156831 | 0.139002 | 0.102704 | 0.163236 | -0.231271 |
| Walc | 0.038537 | 0.221027 | 0.123352 | -0.126112 | 0.118847 | 0.101941 | -0.068524 | -0.022464 | 0.224172 | -0.218321 | ... | 0.653824 | 1.000000 | 0.072388 | 0.180691 | 0.094534 | 0.174471 | 0.162127 | 0.002423 | 0.198192 | -0.208650 |
| health | -0.090097 | 0.117452 | -0.052364 | 0.018856 | -0.035246 | 0.059233 | -0.031590 | 0.024243 | -0.004783 | -0.040365 | ... | 0.076176 | 0.072388 | 1.000000 | 0.097242 | -0.142373 | 0.022265 | 0.095318 | 0.088675 | 0.062234 | -0.151939 |
| failures.Math | 0.065256 | 0.086838 | 0.135294 | -0.098714 | -0.042045 | 0.009900 | -0.249072 | -0.251542 | 0.175791 | -0.182688 | ... | 0.109079 | 0.180691 | 0.097242 | 1.000000 | -0.195892 | 0.009482 | 0.463349 | 0.015751 | 0.139583 | -0.298922 |
| paid.Math | -0.051933 | -0.130729 | -0.033496 | 0.048724 | 0.029835 | 0.021740 | 0.147600 | 0.120694 | -0.056417 | 0.107538 | ... | 0.079078 | 0.094534 | -0.142373 | -0.195892 | 1.000000 | 0.013519 | -0.144090 | 0.054249 | -0.098989 | 0.095768 |
| absences.Math | -0.100414 | -0.063151 | 0.129157 | -0.086783 | 0.033102 | -0.174220 | 0.101776 | 0.017163 | -0.016805 | -0.066153 | ... | 0.156831 | 0.174471 | 0.022265 | 0.009482 | 0.013519 | 1.000000 | 0.032999 | 0.034835 | 0.502579 | -0.067132 |
| failures.Port | 0.113978 | 0.141508 | 0.214816 | -0.114003 | -0.040984 | 0.042273 | -0.204447 | -0.178113 | 0.100886 | -0.182901 | ... | 0.139002 | 0.162127 | 0.095318 | 0.463349 | -0.144090 | 0.032999 | 1.000000 | 0.123901 | 0.040035 | -0.276026 |
| paid.Port | 0.048532 | 0.153260 | -0.054872 | -0.051835 | -0.036568 | -0.057635 | 0.159739 | 0.158431 | -0.066487 | -0.034259 | ... | 0.102704 | 0.002423 | 0.088675 | 0.015751 | 0.054249 | 0.034835 | 0.123901 | 1.000000 | -0.079123 | -0.149325 |
| absences.Port | -0.061991 | -0.000195 | 0.104523 | -0.080585 | -0.023353 | -0.037216 | -0.003134 | -0.001585 | 0.034550 | -0.135360 | ... | 0.163236 | 0.198192 | 0.062234 | 0.139583 | -0.098989 | 0.502579 | 0.040035 | -0.079123 | 1.000000 | -0.187069 |
| G1.Port | -0.173704 | -0.199753 | -0.079498 | 0.168215 | 0.085486 | -0.086695 | 0.228178 | 0.171631 | -0.214055 | 0.260848 | ... | -0.231271 | -0.208650 | -0.151939 | -0.298922 | 0.095768 | -0.067132 | -0.276026 | -0.149325 | -0.187069 | 1.000000 |
30 rows × 30 columns
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
# Set up the matplotlib figure
plt.figure(figsize=(12, 10))
# Draw the heatmap with the correct mask and show correlation values
sns.heatmap(corr_matrix, cmap='plasma', annot=True, fmt=".2f", annot_kws={"size": 8}, mask=mask)
for i in range(corr_matrix.shape[0]):
for j in range(corr_matrix.shape[1]):
if mask[i, j] == False:
plt.text(j + 0.5, i + 0.5, f'{corr_matrix.iloc[i, j]:.2f}', ha='center', va='center', color='white', fontsize=7)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()
The correlation matrix reveals relationships between socio-demographic, academic, and behavioral factors in the dataset:
Positive Correlations:
Higher levels of parental education (Medu and Fedu) positively correlate with each other.
Alcohol consumption on weekends (Walc) positively correlates with alcohol consumption on workdays (Dalc), indicating similar drinking patterns across different days.
Negative Correlations:
Aspirations for higher education (higher) negatively correlate with academic failures in Portuguese (failures.Port), suggesting that students aiming for higher education are less likely to experience failures in Portuguese.
Academic failures in Mathematics (failures.Math) negatively correlate with parental education levels (Medu and Fedu), indicating that students with more educated parents are less likely to fail in mathematics.
Weak Correlations:
Additionally, I calculate and visualize the correlation coefficients of each variable with G1.Port (the first period grade in Portuguese) since this variable is the target variable for building models and achieve the tasks in this project. I provide insights into which factors may have a stronger influence on students' performance in Portuguese during the first period. Through these analyses, I aim to identify significant associations that may contribute to understanding academic performance in Portuguese at the outset of the school year.
# Compute the correlation with 'G1.Port' for each numerical column
correlation_with_G1_Port = numeric_df_cm.corr()['G1.Port'].sort_values(ascending=False)
correlation_with_G1_Port
G1.Port 1.000000 studytime 0.260848 higher 0.247081 Medu 0.228178 Fedu 0.171631 address 0.168215 paid.Math 0.095768 activities 0.095399 famsize 0.085486 famsup 0.048757 internet 0.044543 nursery 0.039044 romantic 0.019430 famrel -0.027898 absences.Math -0.067132 age -0.079498 Pstatus -0.086695 freetime -0.092589 goout -0.094717 paid.Port -0.149325 health -0.151939 schoolsup -0.153273 school -0.173704 absences.Port -0.187069 sex -0.199753 Walc -0.208650 traveltime -0.214055 Dalc -0.231271 failures.Port -0.276026 failures.Math -0.298922 Name: G1.Port, dtype: float64
# Plot the correlations
plt.figure(figsize=(14, 7))
correlation_with_G1_Port.plot(kind='bar', color='skyblue')
plt.xlabel('Variables')
plt.ylabel('Correlation with G1.Port')
plt.title('Correlation of Variables with G1.Port')
plt.show()
The correlation analysis with respect to the first period grades in Portuguese (G1.Port) reveals several notable associations:
Positive Correlations:
Study Time (studytime: 0.26), Desire for Higher Education (higher: 0.25), Mother's Education Level (Medu: 0.23), Father's Education Level (Fedu: 0.17), and Urban Residence (address: 0.17) show moderate to weak positive correlations with G1.Port. This suggests that students who spend more time studying, aspire for higher education, have educated parents, and live in urban areas tend to achieve higher grades in Portuguese during the first period.
Negative Correlations:
Absences in Portuguese (absences.Port: -0.19), Gender (sex: -0.20), Heavy Alcohol Consumption on Weekends (Walc: -0.20), Travel Time to School (traveltime: -0.21), Alcohol Consumption on Workdays (Dalc: -0.23), Academic Failures in Portuguese (failures.Port: -0.27), and Academic Failures in Mathematics (failures.Math: -0.29) exhibit moderate to strong negative correlations with G1.Port, indicating that students who engage in these behaviors or experience academic setbacks tend to achieve lower grades in Portuguese during the first period.
Weak Correlations:
Factors such as Payment for Extra Math Classes (paid.Math), Participation in Extracurricular Activities (activities), Family Size (famsize), Family Support (famsup), Internet Access at Home (internet), Attendance at Nursery School (nursery), and Romantic Involvement (romantic) show weak positive correlations with G1.Port, indicating limited influence on initial Portuguese grades. Similarly, Absences in Math (absences.Math), Health Status (health), and School Support (schoolsup) exhibit weak to moderate negative correlations with G1.Port, suggesting potential limited effects on initial Portuguese grades.
In this section, I aim to prepare the dataset for modeling by executing a series of essential data preprocessing steps. While some preliminary preprocessing steps, such as checking missing values and splitting the data, were already undertaken earlier in the pipeline, I opt to revisit and standardize these procedures to ensure consistency and completeness in preparing the data for model development.
The first step involves rigorously assessing the dataset for missing values to identify any potential data gaps or inconsistencies. By quantifying and examining missing values across the dataset, I can determine the extent of missingness and strategize appropriate handling techniques to mitigate its impact on subsequent modeling tasks.
missing_values = df.isnull().sum()
print(missing_values)
Unnamed: 0 0 school 0 sex 0 age 0 address 0 famsize 0 Pstatus 0 Medu 0 Fedu 0 Mjob 0 Fjob 0 reason 0 guardian 0 traveltime 0 studytime 0 schoolsup 0 famsup 0 activities 0 nursery 0 higher 0 internet 0 romantic 0 famrel 0 freetime 0 goout 0 Dalc 0 Walc 0 health 0 failures.Math 0 paid.Math 0 absences.Math 0 G1.Math 0 G2.Math 0 G3.Math 0 failures.Port 0 paid.Port 0 absences.Port 0 G1.Port 0 G2.Port 0 G3.Port 0 dtype: int64
The output displays the count of missing values for each column in the dataset df. A count of 0 across all columns implies a complete dataset without missing values, ensuring data integrity for subsequent analysis. This step ensures data quality by identifying and addressing any missing values before further analysis or modeling.
In line with the task requirements, any grade features beyond the target variable, 'Mjob', and 'Fjob' are excluded from the modeling process.
df_model = df.drop(columns=['Mjob', 'Fjob', 'G2.Port', 'G3.Port', 'G1.Math', 'G2.Math', 'G3.Math', 'Unnamed: 0'], axis=1)
df_model.head()
| school | sex | age | address | famsize | Pstatus | Medu | Fedu | reason | guardian | ... | Dalc | Walc | health | failures.Math | paid.Math | absences.Math | failures.Port | paid.Port | absences.Port | G1.Port | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | GP | F | 18 | U | GT3 | T | 4 | 4 | reputation | father | ... | 1 | 1 | 4 | 1 | no | 15 | 1 | no | 2 | 14 |
| 1 | MS | F | 18 | R | GT3 | T | 4 | 4 | other | father | ... | 4 | 2 | 5 | 0 | yes | 10 | 0 | no | 0 | 7 |
| 2 | GP | F | 17 | U | LE3 | T | 2 | 2 | course | father | ... | 1 | 3 | 5 | 0 | no | 12 | 0 | no | 2 | 13 |
| 3 | MS | F | 17 | R | GT3 | T | 1 | 2 | course | father | ... | 1 | 2 | 3 | 0 | no | 0 | 0 | no | 0 | 13 |
| 4 | GP | F | 15 | U | GT3 | T | 4 | 4 | reputation | mother | ... | 1 | 2 | 2 | 0 | yes | 0 | 0 | no | 2 | 14 |
5 rows × 32 columns
Next, I apply label encoding to transform categorical features into numerical equivalents, I establish a standardized format conducive to model training and evaluation. Then, I would like to check all variables again to make sure that there is no any variables in the string format.
def object_to_int(dataframe_series):
if dataframe_series.dtype=='object':
dataframe_series = LabelEncoder().fit_transform(dataframe_series)
return dataframe_series
df_model = df_model.apply(lambda x: object_to_int(x))
df_model.head()
| school | sex | age | address | famsize | Pstatus | Medu | Fedu | reason | guardian | ... | Dalc | Walc | health | failures.Math | paid.Math | absences.Math | failures.Port | paid.Port | absences.Port | G1.Port | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 18 | 1 | 0 | 1 | 4 | 4 | 3 | 0 | ... | 1 | 1 | 4 | 1 | 0 | 15 | 1 | 0 | 2 | 14 |
| 1 | 1 | 0 | 18 | 0 | 0 | 1 | 4 | 4 | 2 | 0 | ... | 4 | 2 | 5 | 0 | 1 | 10 | 0 | 0 | 0 | 7 |
| 2 | 0 | 0 | 17 | 1 | 1 | 1 | 2 | 2 | 0 | 0 | ... | 1 | 3 | 5 | 0 | 0 | 12 | 0 | 0 | 2 | 13 |
| 3 | 1 | 0 | 17 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | ... | 1 | 2 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 13 |
| 4 | 0 | 0 | 15 | 1 | 0 | 1 | 4 | 4 | 3 | 1 | ... | 1 | 2 | 2 | 0 | 1 | 0 | 0 | 0 | 2 | 14 |
5 rows × 32 columns
#Lets get confirmed that none of our variables are in string format
df_model.dtypes
school int64 sex int64 age int64 address int64 famsize int64 Pstatus int64 Medu int64 Fedu int64 reason int64 guardian int64 traveltime int64 studytime int64 schoolsup int64 famsup int64 activities int64 nursery int64 higher int64 internet int64 romantic int64 famrel int64 freetime int64 goout int64 Dalc int64 Walc int64 health int64 failures.Math int64 paid.Math int64 absences.Math int64 failures.Port int64 paid.Port int64 absences.Port int64 G1.Port int64 dtype: object
To prioritize relevant features and mitigate the curse of dimensionality, I employ the SelectKBest method to perform feature selection. By evaluating the statistical significance of each feature with the target variable, I curate a subset of the most informative features that contribute significantly to predictive performance. This judicious feature selection enhances model interpretability and generalization capabilities.
target_variable = 'G1.Port'
k=10
X = df_model.drop(columns=[target_variable])
y = df_model[target_variable]
selector = SelectKBest(score_func=f_regression, k=k)
X_selected = selector.fit_transform(X, y)
selected_feature_indices = selector.get_support(indices=True)
selected_feature_names = df_model.columns[selected_feature_indices].tolist()
print("Selected Features using SelectKBest:")
print(selected_feature_names)
Selected Features using SelectKBest: ['sex', 'Medu', 'traveltime', 'studytime', 'higher', 'Dalc', 'Walc', 'failures.Math', 'failures.Port', 'absences.Port']
As you can see, 'G1.Port' is the target variable (or dependent variable) for the tasks of this project. I decided to use the SelectKBest method, which is a common technique to choose the top k features based on their importance scores from the function 'f_regression' which calculates the F-value for regression tasks, and the parameter k specifying the number of features to select. In this case, I aim to select the top 10 most important features (k=10). Through this step, a list of the top 10 features ['sex', 'Medu', 'traveltime', 'studytime', 'higher', 'Dalc', 'Walc', 'failures.Math', 'failures.Port', 'absences.Port'] are provided as the best selected features for predicting the target variable 'G1.Port'
# create a new dataframe
best_selected_features = df_model[['sex', 'Medu', 'traveltime', 'studytime', 'higher', 'Dalc', 'Walc', 'failures.Math', 'failures.Port', 'absences.Port', 'G1.Port']].copy()
# Define features and target variable
X = best_selected_features.drop(columns=['G1.Port'])
y = best_selected_features['G1.Port']
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
After I obtained the best selected features from the SelectKBest method, I then split the data into train and test sets by starting with creating a New DataFrame which is 'best_selected_features'. This new DataFrame is created containing the selected features along with the target variable 'G1.Port'.
Later on, I define features and target variable by 'X' are defined by including the selected features and excluding the target variable 'G1.Port' from the DataFrame. Whereas, 'y': extracted the target variable 'G1.Port'.
After that, the dataset is split into training and testing sets using 'train_test_split', specified a 80-20 split between training and testing data with 'test_size=0.2', and ensures reproducibility with 'random_state=42'.
Following thorough data exploration and preprocessing, the project transitions into the modeling phase aimed at addressing the defined tasks of predicting student success in Mathematics and Portuguese. This phase entails the implementation of both regression and classification models to gain insights into the determinants of academic achievement.
Task 1: Build a predictive model for the target variable 'G1.Port' without using any of the other grade features. Moreover, your model must contain the variables activities, famrel, failures.Math but not the variables Mjob, Fjob.
# Train and evaluate Linear Regression
lr = LinearRegression()
lr_scores = cross_val_score(lr, X_train, y_train, cv=5, scoring='r2')
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_r2 = r2_score(y_test, lr_pred)
lr_mae = mean_absolute_error(y_test, lr_pred)
lr_mse = mean_squared_error(y_test, lr_pred)
print('Cross-validated R-squared scores:', lr_scores)
print('Mean R-squared:', lr_scores.mean())
print('R-squared:', metrics.r2_score(y_test, lr_pred))
print('MAE:', lr_mae)
print('MSE:', lr_mse)
Cross-validated R-squared scores: [0.23913556 0.2075402 0.23538813 0.20961651 0.28567931] Mean R-squared: 0.23547194176269137 R-squared: 0.17716867081093846 MAE: 1.844256955431021 MSE: 5.633998244800398
As the output shown, I can explain these points as follows:
Cross-validated R-squared scores: These scores, computed using 5-fold cross-validation, range from 0.207 to 0.286, indicating the model's goodness of fit to the training data.
Mean R-squared: The average R-squared value across the cross-validated scores is approximately 0.235, providing an overall assessment of the model's performance on the training data.
R-squared: This value, approximately 0.177, measures how much of the variance in the target variable (G1.Port) the model can explain on the test data.
Mean Absolute Error (MAE): The MAE, approximately 1.844, represents the average absolute difference between the predicted and actual values, indicating the model's average prediction error.
Mean Squared Error (MSE): The MSE, approximately 5.634, calculates the average squared difference between predicted and actual G1.Port values, serving as a measure of the model's predictive accuracy
# Define the range of hyperparameters to search
param_grid = {
'alpha': [0.001, 0.01, 0.1, 1.0, 10.0], # Regularization strength
}
# Initialize the Ridge Regression model
ridge = Ridge()
# Perform Grid Search to find the best hyperparameters
grid_search = GridSearchCV(ridge, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
# Get the best hyperparameters found by Grid Search
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)
# Train Ridge Regression with the best hyperparameters
best_ridge = Ridge(**best_params)
best_ridge.fit(X_train, y_train)
# Evaluate the model
ridge_pred = best_ridge.predict(X_test)
ridge_r2 = r2_score(y_test, ridge_pred)
ridge_mae = mean_absolute_error(y_test, ridge_pred)
ridge_mse = mean_squared_error(y_test, ridge_pred)
print('R-squared:', ridge_r2)
print('MAE:', ridge_mae)
print('MSE:', ridge_mse)
Best Hyperparameters: {'alpha': 1.0}
R-squared: 0.17764748236272965
MAE: 1.8429098258410397
MSE: 5.630719780130075
As the output shown, I can explain these points as follows:
Best Hyperparameters: The optimal regularization strength parameter (alpha) is determined to be 1.0, indicating the degree of penalty applied to model coefficients during training to prevent overfitting.
R-squared: With a value of approximately 0.178, the coefficient of determination (R-squared) indicates that the model explains about 17.8% of the variance in the first period grades in Portuguese.
MAE (Mean Absolute Error): The MAE, approximately 1.843, represents the average magnitude of errors between actual and predicted values, suggesting an average deviation of about 1.843 grade points.
MSE (Mean Squared Error): With a value of approximately 5.631, the MSE quantifies the average squared deviation of predictions from actual values, providing insight into the model's predictive accuracy.
# Define the range of hyperparameters to search
param_grid = {
'max_depth': [10, 20, 30, 40], # Maximum depth of the tree
'min_samples_split': [2, 5, 10], # Minimum number of samples required to split an internal node
}
# Initialize the Decision Tree Regression model
dt = DecisionTreeRegressor(random_state=42)
# Perform Grid Search to find the best hyperparameters
grid_search = GridSearchCV(dt, param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
# Get the best hyperparameters found by Grid Search
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)
# Train Decision Tree Regression with the best hyperparameters
best_dt = DecisionTreeRegressor(**best_params, random_state=42)
best_dt.fit(X_train, y_train)
# Evaluate the model
dt_pred = best_dt.predict(X_test)
dt_r2 = r2_score(y_test, dt_pred)
dt_mae = mean_absolute_error(y_test, dt_pred)
dt_mse = mean_squared_error(y_test, dt_pred)
print('R-squared:', dt_r2)
print('MAE:', dt_mae)
print('MSE:', dt)
Best Hyperparameters: {'max_depth': 20, 'min_samples_split': 2}
R-squared: 0.828250202686201
MAE: 0.46908643892339547
MSE: DecisionTreeRegressor(random_state=42)
As the output shown, I can explain these points as follows:
Best Hyperparameters: Grid search identified the optimal hyperparameters for the Decision Tree Regression model: maximum tree depth of 20 and minimum samples required to split an internal node of 2.
R-squared: The model achieves an R-squared value of 0.828, indicating that approximately 82.83% of the variance in the target variable is explained.
MAE (Mean Absolute Error): With an average absolute difference of 0.469 between predicted and actual values, the model's predictions deviate by approximately 0.469 units on average.
MSE (Mean Squared Error): The mean squared error is 0.469, quantifying the average squared difference between predicted and actual values, providing insight into prediction accuracy.
# Train and evaluate Random Forest Regression
rf = RandomForestRegressor(random_state=42)
rf_scores = cross_val_score(rf, X_train, y_train, cv=5, scoring='r2')
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_r2 = r2_score(y_test, rf_pred)
rf_mae = mean_absolute_error(y_test, rf_pred)
rf_mse = mean_squared_error(y_test, rf_pred)
print('Cross-validated R-squared scores:', rf_scores)
print('Mean R-squared:', rf_scores.mean())
print('R-squared:', metrics.r2_score(y_test, rf_pred))
print('MAE:', rf_mae)
print('MSE:', rf_mse)
Cross-validated R-squared scores: [0.70405323 0.61785373 0.75527119 0.77479875 0.62716736] Mean R-squared: 0.6958288535619827 R-squared: 0.8097503762626015 MAE: 0.728088481492686 MSE: 1.3026558520405547
As the output shown, I can explain these points as follows:
Cross-validated R-squared scores: Represented by [0.70, 0.62, 0.76, 0.77, 0.63], these scores are obtained through cross-validation, providing insights into model performance across different subsets of data.
Mean R-squared: This average score of 0.70 summarizes the overall model performance obtained from cross-validation.
R-squared: With a value of 0.81 on the test set, this metric indicates how well the model explains the variance in the target variable.
MAE (Mean Absolute Error): At 0.73, this metric quantifies the average prediction error between the predicted and actual values.
MSE (Mean Squared Error): It is the average of squared errors, providing an overall measure of prediction accuracy. In this case, the MSE is 1.30.
# Train and evaluate Support Vector Regression
svr = SVR()
svr_scores = cross_val_score(svr, X_train, y_train, cv=5, scoring='r2')
svr.fit(X_train, y_train)
svr_pred = svr.predict(X_test)
svr_r2 = r2_score(y_test, svr_pred)
svr_mae = mean_absolute_error(y_test, svr_pred)
svr_mse = mean_squared_error(y_test, svr_pred)
print('Cross-validated R-squared scores:', svr_scores)
print('Mean R-squared:', svr_scores.mean())
print('R-squared:', metrics.r2_score(y_test, svr_pred))
print('MAE:', svr_mae)
print('MSE:', svr_mse)
Cross-validated R-squared scores: [0.22580392 0.27065427 0.23877645 0.22619695 0.29896255] Mean R-squared: 0.25207882727179537 R-squared: 0.17899634790543817 MAE: 1.8005850908388994 MSE: 5.6214839795103035
As the output shown, I can explain these points as follows:
Cross-validated R-squared scores: These values, [0.2258, 0.2707, 0.2388, 0.2262, 0.2990], are obtained from cross-validation, reflecting model performance across different data subsets.
Mean R-squared: The average of the cross-validated R-squared scores is 0.2521, providing an overall assessment of model performance.
R-squared: On the test set, the R-squared score is 0.179, indicating the model's ability to explain 17.9% of the variance in the target variable.
MAE (Mean Absolute Error): At 1.8006, this metric represents the average magnitude of errors between predicted and actual values.
MSE (Mean Squared Error): With a value of 5.6215, the MSE quantifies the average squared difference between predicted and actual values, highlighting the overall prediction accuracy of the model.
# Train and evaluate Gradient Boosting Regression
gb = GradientBoostingRegressor(random_state=42)
gb_scores = cross_val_score(gb, X_train, y_train, cv=5, scoring='r2')
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)
gb_r2 = r2_score(y_test, gb_pred)
gb_mae = mean_absolute_error(y_test, gb_pred)
gb_mse = mean_squared_error(y_test, gb_pred)
print('Cross-validated R-squared scores:', gb_scores)
print('Mean R-squared:', gb_scores.mean())
print('R-squared:', gb_r2)
print('MAE:', gb_mae)
print('MSE:', gb_mse)
Cross-validated R-squared scores: [0.4119876 0.32970305 0.44338088 0.42863375 0.3991179 ] Mean R-squared: 0.4025646389696139 R-squared: 0.37681715412463423 MAE: 1.5879137178250589 MSE: 4.266987577286091
As the output shown, I can explain these points as follows:
Cross-validated R-squared scores: These values represent the model's performance across different data subsets obtained from cross-validation: [0.412, 0.330, 0.443, 0.429, 0.399].
Mean R-squared: The average cross-validated R-squared score is approximately 0.403.
R-squared: This score, evaluated on the test set, indicates how much of the variance in the target variable is explained by the model (0.377).
MAE (Mean Absolute Error): The average absolute difference between predicted and actual values in the test set is approximately 1.588.
MSE (Mean Squared Error): It measures the average of the squared differences between predicted and actual values in the test set, giving more weight to larger errors. In this case, it is approximately 4.267.
Bin the target variable G1.Port into 4 categories in such a way that the resulting bins contain roughly equal number of cases. Use this newly created categorical variable as response for a classification model. Again do no tuse any other grade feature and build a model that contains the variables activities, famrel, failures.Math but not the variables Mjob, Fjob.
Before building the classification models, I need to bin the target variable 'G1.Port' into 4 categories with approximately equal numbers of cases using the pd.qcut function. The resulting binned variable is named 'G1.Port_binned'. The provided information presents the count and percentage of occurrences in each bin after binning the 'G1.Port' variable into 4 categories.
# Create a new dataframe
best_selected_features = df_model[['sex', 'Medu', 'traveltime', 'studytime', 'higher', 'Dalc', 'Walc', 'failures.Math', 'failures.Port', 'G1.Port']].copy()
# Bin the target variable into 4 categories with roughly equal numbers of cases
best_selected_features['G1.Port_binned'] = pd.qcut(best_selected_features['G1.Port'], q=4, labels=False)
# Display the resulting dataframe with the binned target variable
#print(df_model[['G1.Port', 'G1.Port_binned']])
# Calculate the count of occurrences in each bin
bin_counts = best_selected_features['G1.Port_binned'].value_counts()
# Calculate the percentage of data in each bin
bin_percentages = bin_counts / len(best_selected_features) * 100
# Plot the count of occurrences in each bin with percentage annotations
plt.figure(figsize=(15, 10))
ax = sns.countplot(data=best_selected_features, x='G1.Port_binned', palette='viridis')
plt.title('Distribution of Binned G1.Port')
plt.xlabel('Binned G1.Port')
plt.ylabel('Count')
# Annotate each bar with its percentage value
total_count = len(best_selected_features)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/total_count)
x = p.get_x() + p.get_width() / 2
y = p.get_height()
ax.annotate(percentage, (x, y), ha='center', va='bottom')
plt.show()
# Step 1: Split the dataset
X = best_selected_features.drop(columns=['G1.Port_binned', 'G1.Port']) # Features
y = best_selected_features['G1.Port_binned'] # Response
# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Choose a classification algorithm and train the model
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluate the model
y_pred_proba = model.predict_proba(X_test) # Use predict_proba instead of predict
y_pred = model.predict(X_test) # Keep y_pred for classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
Classification Report:
precision recall f1-score support
0 0.43 0.48 0.45 50
1 0.21 0.26 0.23 43
2 0.33 0.26 0.29 53
3 0.48 0.42 0.45 38
accuracy 0.35 184
macro avg 0.36 0.36 0.36 184
weighted avg 0.36 0.35 0.35 184
As the result shown, the accuracy of the logistic regression model is 0.35, which means that approximately 35% of the instances in the test set were correctly classified by the model. However, it is essential to consider other metrics to gain a comprehensive understanding of the model's performance.
# Assuming you have already computed the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix with a different colormap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='YlGnBu')
# Manually add numbers to the heatmap
for i in range(len(conf_matrix)):
for j in range(len(conf_matrix[i])):
plt.text(j + 0.5, i + 0.5, conf_matrix[i][j], ha='center', va='center', color='grey')
plt.title('Confusion Matrix: Logistic Regression')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In the confusion matrix of logistic regression, it shows the true positive cases represent instances where the model correctly predicts the class for each grade category.
The value 24 represents the number of instances where the model correctly predicted the lowest grade category (class 0).
The value 11 represents the number of instances where the model correctly predicted the second lowest grade category (class 1).
The value 14 represents the number of instances where the model correctly predicted the third lowest grade category (class 2).
The value 16 represents the number of instances where the model correctly predicted the highest grade category (class 3).
# Compute ROC curve and ROC area for each class
n_classes = len(np.unique(y))
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(label_binarize(y_test, classes=np.unique(y))[:, i], y_pred_proba[:, i]) # Use y_pred_proba instead of y_pred
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(label_binarize(y_test, classes=np.unique(y)).ravel(), y_pred_proba.ravel()) # Use y_pred_proba instead of y_pred
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# Plot ROC curves for each class
plt.figure(figsize=(10, 8))
colors = cycle(['blue', 'red', 'green', 'orange']) # Define colors for each class
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(i, roc_auc[i]))
# Plot micro-average ROC curve
plt.plot(fpr["micro"], tpr["micro"], color='deeppink', lw=2, linestyle='--', label='Micro-average ROC curve (area = {0:0.2f})'.format(roc_auc["micro"]))
# Add labels and legend
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Logistic Regression')
plt.legend(loc='lower right')
plt.show()
As the output shown, it can noticeable that ROC curve of class 3 has the highest ability to differentiate between true positive and false positive rates with the highest AUC values (0.75).
# Choose a classification algorithm and train the model
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
# Evaluate the model
y_pred_proba = model.predict_proba(X_test) # Use predict_proba instead of predict
y_pred = model.predict(X_test) # Keep y_pred for classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
Classification Report:
precision recall f1-score support
0 0.86 0.88 0.87 50
1 0.57 0.86 0.69 43
2 0.87 0.62 0.73 53
3 0.80 0.63 0.71 38
accuracy 0.75 184
macro avg 0.78 0.75 0.75 184
weighted avg 0.78 0.75 0.75 184
As the result shown, the accuracy of the decision tree model is 0.76, which means that approximately 76% of the instances in the test set were correctly classified by the model.
# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='YlGnBu')
# Manually add numbers to the heatmap
for i in range(len(conf_matrix)):
for j in range(len(conf_matrix[i])):
plt.text(j + 0.5, i + 0.5, conf_matrix[i][j], ha='center', va='center', color='grey')
plt.title('Confusion Matrix: Decision Tree Classification')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In the confusion matrix of decision tree classification, it shows the true positive cases represent instances where the model correctly predicts the class for each grade category.
The value 45 represents the number of instances where the model correctly predicted the lowest grade category (class 0).
The value 37 represents the number of instances where the model correctly predicted the second lowest grade category (class 1).
The value 33 represents the number of instances where the model correctly predicted the third lowest grade category (class 2).
The value 24 represents the number of instances where the model correctly predicted the highest grade category (class 3).
# Compute ROC curve and ROC area for each class
n_classes = len(np.unique(y))
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(label_binarize(y_test, classes=np.unique(y))[:, i], y_pred_proba[:, i]) # Use y_pred_proba instead of y_pred
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(label_binarize(y_test, classes=np.unique(y)).ravel(), y_pred_proba.ravel()) # Use y_pred_proba instead of y_pred
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# Plot ROC curves for each class
plt.figure(figsize=(10, 8))
colors = cycle(['blue', 'red', 'green', 'orange']) # Define colors for each class
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(i, roc_auc[i]))
# Plot micro-average ROC curve
plt.plot(fpr["micro"], tpr["micro"], color='deeppink', lw=2, linestyle='--', label='Micro-average ROC curve (area = {0:0.2f})'.format(roc_auc["micro"]))
# Add labels and legend
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Decision Tree Classification')
plt.legend(loc='lower right')
plt.show()
As the output shown, it can noticeable that ROC curve of class 0 has the highest ability to differentiate between true positive and false positive rates with the highest AUC values (0.96).
# Choose a classification algorithm and train the model
model = RandomForestClassifier()
model.fit(X_train, y_train)
# Evaluate the model
y_pred_proba = model.predict_proba(X_test) # Use predict_proba instead of predict
y_pred = model.predict(X_test) # Keep y_pred for classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
Classification Report:
precision recall f1-score support
0 0.85 0.88 0.86 50
1 0.57 0.81 0.67 43
2 0.85 0.62 0.72 53
3 0.75 0.63 0.69 38
accuracy 0.74 184
macro avg 0.75 0.74 0.73 184
weighted avg 0.76 0.74 0.74 184
As the result shown, the accuracy of the random forest classification model is 0.74, which means that approximately 74% of the instances in the test set were correctly classified by the model.
# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='YlGnBu')
# Manually add numbers to the heatmap
for i in range(len(conf_matrix)):
for j in range(len(conf_matrix[i])):
plt.text(j + 0.5, i + 0.5, conf_matrix[i][j], ha='center', va='center', color='grey')
plt.title('Confusion Matrix: Random Forest Classification')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In the confusion matrix of random forest classification, it shows the true positive cases represent instances where the model correctly predicts the class for each grade category.
The value 44 represents the number of instances where the model correctly predicted the lowest grade category (class 0).
The value 31 represents the number of instances where the model correctly predicted the second lowest grade category (class 1).
The value 38 represents the number of instances where the model correctly predicted the third lowest grade category (class 2).
The value 24 represents the number of instances where the model correctly predicted the highest grade category (class 3).
# Compute ROC curve and ROC area for each class
n_classes = len(np.unique(y))
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(label_binarize(y_test, classes=np.unique(y))[:, i], y_pred_proba[:, i]) # Use y_pred_proba instead of y_pred
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(label_binarize(y_test, classes=np.unique(y)).ravel(), y_pred_proba.ravel()) # Use y_pred_proba instead of y_pred
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# Plot ROC curves for each class
plt.figure(figsize=(10, 8))
colors = cycle(['blue', 'red', 'green', 'orange']) # Define colors for each class
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(i, roc_auc[i]))
# Plot micro-average ROC curve
plt.plot(fpr["micro"], tpr["micro"], color='deeppink', lw=2, linestyle='--', label='Micro-average ROC curve (area = {0:0.2f})'.format(roc_auc["micro"]))
# Add labels and legend
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Random Forest Classification')
plt.legend(loc='lower right')
plt.show()
As the output shown, it can noticeable that ROC curve of class 0 has the highest ability to differentiate between true positive and false positive rates with the highest AUC values (0.96).
# Choose a classification algorithm and train the model
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
# Evaluate the model
y_pred_proba = model.predict_proba(X_test) # Use predict_proba instead of predict
y_pred = model.predict(X_test) # Keep y_pred for classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
Classification Report:
precision recall f1-score support
0 0.73 0.60 0.66 50
1 0.43 0.58 0.50 43
2 0.56 0.53 0.54 53
3 0.46 0.42 0.44 38
accuracy 0.54 184
macro avg 0.54 0.53 0.53 184
weighted avg 0.56 0.54 0.54 184
As the result shown, the accuracy of the gradient boosting classification model is 0.54, which means that approximately 54% of the instances in the test set were correctly classified by the model.
# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='YlGnBu')
# Manually add numbers to the heatmap
for i in range(len(conf_matrix)):
for j in range(len(conf_matrix[i])):
plt.text(j + 0.5, i + 0.5, conf_matrix[i][j], ha='center', va='center', color='grey')
plt.title('Confusion Matrix: Gradient Boosting Classification')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In the confusion matrix of gradient boosting classification, it shows the true positive cases represent instances where the model correctly predicts the class for each grade category.
The value 31 represents the number of instances where the model correctly predicted the lowest grade category (class 0).
The value 25 represents the number of instances where the model correctly predicted the second lowest grade category (class 1).
The value 28 represents the number of instances where the model correctly predicted the third lowest grade category (class 2).
The value 16 represents the number of instances where the model correctly predicted the highest grade category (class 3).
# Compute ROC curve and ROC area for each class
n_classes = len(np.unique(y))
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(label_binarize(y_test, classes=np.unique(y))[:, i], y_pred_proba[:, i]) # Use y_pred_proba instead of y_pred
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(label_binarize(y_test, classes=np.unique(y)).ravel(), y_pred_proba.ravel()) # Use y_pred_proba instead of y_pred
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# Plot ROC curves for each class
plt.figure(figsize=(10, 8))
colors = cycle(['blue', 'red', 'green', 'orange']) # Define colors for each class
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(i, roc_auc[i]))
# Plot micro-average ROC curve
plt.plot(fpr["micro"], tpr["micro"], color='deeppink', lw=2, linestyle='--', label='Micro-average ROC curve (area = {0:0.2f})'.format(roc_auc["micro"]))
# Add labels and legend
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Gradient Boosting Classification')
plt.legend(loc='lower right')
plt.show()
As the output shown, it can noticeable that ROC curve of class 0 has the highest ability to differentiate between true positive and false positive rates with the highest AUC values (0.90).
# Choose a classification algorithm and train the model
model = SVC(probability=True) # Set probability=True
model.fit(X_train, y_train)
# Evaluate the model
y_pred_proba = model.predict_proba(X_test) # Use predict_proba instead of predict
y_pred = model.predict(X_test) # Keep y_pred for classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
Classification Report:
precision recall f1-score support
0 0.64 0.56 0.60 50
1 0.35 0.67 0.46 43
2 0.44 0.21 0.28 53
3 0.38 0.32 0.34 38
accuracy 0.43 184
macro avg 0.45 0.44 0.42 184
weighted avg 0.46 0.43 0.42 184
As the result shown, the accuracy of the support vector classification model is 0.43, which means that approximately 43% of the instances in the test set were correctly classified by the model.
# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='YlGnBu')
# Manually add numbers to the heatmap
for i in range(len(conf_matrix)):
for j in range(len(conf_matrix[i])):
plt.text(j + 0.5, i + 0.5, conf_matrix[i][j], ha='center', va='center', color='grey')
plt.title('Confusion Matrix: Support Vector Classification')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In the confusion matrix of Support Vector Classification, it shows the true positive cases represent instances where the model correctly predicts the class for each grade category.
The value 28 represents the number of instances where the model correctly predicted the lowest grade category (class 0).
The value 29 represents the number of instances where the model correctly predicted the second lowest grade category (class 1).
The value 11 represents the number of instances where the model correctly predicted the third lowest grade category (class 2).
The value 12 represents the number of instances where the model correctly predicted the highest grade category (class 3).
# Compute ROC curve and ROC area for each class
n_classes = len(np.unique(y))
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(label_binarize(y_test, classes=np.unique(y))[:, i], y_pred_proba[:, i]) # Use y_pred_proba instead of y_pred
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(label_binarize(y_test, classes=np.unique(y)).ravel(), y_pred_proba.ravel()) # Use y_pred_proba instead of y_pred
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# Plot ROC curves for each class
plt.figure(figsize=(10, 8))
colors = cycle(['blue', 'red', 'green', 'orange']) # Define colors for each class
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(i, roc_auc[i]))
# Plot micro-average ROC curve
plt.plot(fpr["micro"], tpr["micro"], color='deeppink', lw=2, linestyle='--', label='Micro-average ROC curve (area = {0:0.2f})'.format(roc_auc["micro"]))
# Add labels and legend
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: Support Vector Classification')
plt.legend(loc='lower right')
plt.show()
As the output shown, it can noticeable that ROC curve of class 0 has the highest ability to differentiate between true positive and false positive rates with the highest AUC values (0.78).
# Choose a classification algorithm and train the model
model = KNeighborsClassifier()
model.fit(X_train, y_train)
# Evaluate the model
y_pred_proba = model.predict_proba(X_test) # Use predict_proba instead of predict
y_pred = model.predict(X_test) # Keep y_pred for classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
Classification Report:
precision recall f1-score support
0 0.65 0.72 0.69 50
1 0.41 0.63 0.50 43
2 0.62 0.53 0.57 53
3 0.61 0.29 0.39 38
accuracy 0.55 184
macro avg 0.57 0.54 0.54 184
weighted avg 0.58 0.55 0.55 184
As the result shown, the accuracy of the KNeighbors classification model is 0.55, which means that approximately 43% of the instances in the test set were correctly classified by the model.
# Create a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, cmap='YlGnBu')
# Manually add numbers to the heatmap
for i in range(len(conf_matrix)):
for j in range(len(conf_matrix[i])):
plt.text(j + 0.5, i + 0.5, conf_matrix[i][j], ha='center', va='center', color='grey')
plt.title('Confusion Matrix: KNeighbors Classification')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
In the confusion matrix of KNeighbors Classification, it shows the true positive cases represent instances where the model correctly predicts the class for each grade category.
The value 36 represents the number of instances where the model correctly predicted the lowest grade category (class 0).
The value 27 represents the number of instances where the model correctly predicted the second lowest grade category (class 1).
The value 28 represents the number of instances where the model correctly predicted the third lowest grade category (class 2).
The value 11 represents the number of instances where the model correctly predicted the highest grade category (class 3).
# Compute ROC curve and ROC area for each class
n_classes = len(np.unique(y))
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(label_binarize(y_test, classes=np.unique(y))[:, i], y_pred_proba[:, i]) # Use y_pred_proba instead of y_pred
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(label_binarize(y_test, classes=np.unique(y)).ravel(), y_pred_proba.ravel()) # Use y_pred_proba instead of y_pred
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
# Plot ROC curves for each class
plt.figure(figsize=(10, 8))
colors = cycle(['blue', 'red', 'green', 'orange']) # Define colors for each class
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(i, roc_auc[i]))
# Plot micro-average ROC curve
plt.plot(fpr["micro"], tpr["micro"], color='deeppink', lw=2, linestyle='--', label='Micro-average ROC curve (area = {0:0.2f})'.format(roc_auc["micro"]))
# Add labels and legend
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve: KNeighbors Classification')
plt.legend(loc='lower right')
plt.show()
As the output shown, it can noticeable that ROC curve of class 0 has the highest ability to differentiate between true positive and false positive rates with the highest AUC values (0.87).
# Create a DataFrame to store model evaluation metrics
regression_results = pd.DataFrame({
'Model': ['Linear Regression', 'Ridge Regression', 'Decision Tree', 'Random Forest', 'Support Vector', 'Gradient Boosting'],
'R-squared': [lr_r2, ridge_r2, dt_r2, rf_r2, svr_r2, gb_r2],
'MAE': [lr_mae, ridge_mae, dt_mae, rf_mae, svr_mae, gb_mae],
'MSE': [lr_mse, ridge_mse, dt_mse, rf_mse, svr_mse, gb_mse]
})
# Display the results table
print(regression_results)
Model R-squared MAE MSE 0 Linear Regression 0.177169 1.844257 5.633998 1 Ridge Regression 0.177647 1.842910 5.630720 2 Decision Tree 0.828250 0.469086 1.175986 3 Random Forest 0.809750 0.728088 1.302656 4 Support Vector 0.178996 1.800585 5.621484 5 Gradient Boosting 0.376817 1.587914 4.266988
Based on the provided results, the Decision Tree and Random Forest models demonstrate superior performance compared to the other models for predicting Task 1.
Decision Tree:
Highest R-squared Value: Achieving an impressive R-squared value of approximately 0.83, the Decision Tree model demonstrates a strong ability to explain the variance in the data.
Lowest Error Metrics: With the lowest mean absolute error (MAE) of about 0.47 and a relatively low mean squared error (MSE) around 1.18, the Decision Tree model minimizes prediction errors effectively.
Random Forest:
Strong R-squared Value: The Random Forest model also performs well with an R-squared value of approximately 0.81, indicating a high degree of variance explained by the model.
Comparable Error Metrics: Although slightly higher MAE compared to the Decision Tree model, approximately 0.73, the Random Forest model maintains a similar MSE of approximately 1.30, ensuring robust predictive performance.
# Store the accuracy scores, precision (weighted), recall (weighted), and F1-score (weighted) of each model
accuracy_scores = [0.35, 0.76, 0.76, 0.54, 0.43, 0.55] # Replace with your actual accuracy scores
precision_scores = [0.36, 0.79, 0.78, 0.56, 0.46, 0.58] # Replace with your actual precision (weighted) scores
recall_scores = [0.35, 0.76, 0.76, 0.54, 0.43, 0.55] # Replace with your actual recall (weighted) scores
f1_scores = [0.35, 0.76, 0.76, 0.54, 0.42, 0.55] # Replace with your actual F1-score (weighted) scores
# Define the names of the classification models
model_names = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'Gradient Boosting', 'Support Vector', 'KNeighbors']
# Create a DataFrame to store the model names and evaluation metrics
classification_report = pd.DataFrame({
'Model': model_names,
'Accuracy': accuracy_scores,
'Precision (weighted)': precision_scores,
'Recall (weighted)': recall_scores,
'F1-score (weighted)': f1_scores
})
# Print or display the summary report table
print("Summary Report for Classification Models:")
print(classification_report)
Summary Report for Classification Models:
Model Accuracy Precision (weighted) Recall (weighted) \
0 Logistic Regression 0.35 0.36 0.35
1 Decision Tree 0.76 0.79 0.76
2 Random Forest 0.76 0.78 0.76
3 Gradient Boosting 0.54 0.56 0.54
4 Support Vector 0.43 0.46 0.43
5 KNeighbors 0.55 0.58 0.55
F1-score (weighted)
0 0.35
1 0.76
2 0.76
3 0.54
4 0.42
5 0.55
Based on the provided summary report for classification models in Task 2, both the Decision Tree and Random Forest models exhibit the highest accuracy and balanced performance across precision, recall, and F1-score metrics.
Decision Tree:
Accuracy: 76%
Precision (weighted): 0.79
Recall (weighted): 0.76
F1-score (weighted): 0.76
Random Forest:
Accuracy: 76%
Precision (weighted): 0.78
Recall (weighted): 0.76
F1-score (weighted): 0.76
In this project, I embarked on a comprehensive analysis to predict and understand student academic performance, with a particular focus on predicting the first period grade in Portuguese (G1.Port) and classifying student performance categories. Leveraging a combination of regression and classification models, I aimed to uncover insights that could assist educators in identifying at-risk students and tailoring interventions to support their academic success.
My findings revealed that both the Decision Tree and Random Forest models emerged as strong contenders for predicting G1.Port, boasting high accuracy and demonstrating superior predictive accuracy with low mean absolute error (MAE) and mean squared error (MSE). These models exhibited a remarkable ability to explain the variance in G1.Port, providing educators with valuable insights into students' academic performance trends.
Similarly, in the classification task, the Decision Tree and Random Forest models excelled in categorizing students into performance categories, achieving high accuracy and balanced performance across precision, recall, and F1-score metrics. This suggests that these models can effectively identify performance trends and help educators tailor interventions to support students accordingly.
Agrawal, S. K. (2023, September 29). Metrics to evaluate your classification model to take the right decisions. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/07/metrics-to-evaluate-your-classification-model-to-take-the-right-decisions/
Arora, L. (2022, August 25). Build your first machine learning pipeline using scikit-learn!. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/01/build-your-first-machine-learning-pipeline-using-scikit-learn/
AK, A. (2023, October 15). Clever cuts: Uncovering the power of SelectKBest for feature selection in machine learning. Medium. https://medium.com/@abelkuriakose/clever-cuts-uncovering-the-power-of-selectkbest-for-feature-selection-in-machine-learning-c8d20d75c82f
Brownlee, J. (2020, August 20). How to choose a feature selection method for machine learning. MachineLearningMastery.com. https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/
D, K. (2023, February 16). Optimizing performance: Selectkbest for efficient feature selection in machine learning. Medium. https://medium.com/@Kavya2099/optimizing-performance-selectkbest-for-efficient-feature-selection-in-machine-learning-3b635905ed48
DataTechNotes. (2021, February 11). SelectKBest feature selection example in Python. SelectKBest Feature Selection Example in Python. https://www.datatechnotes.com/2021/02/seleckbest-feature-selection-example-in-python.html
Narkhede, S. (2021, June 15). Understanding AUC - roc curve. Medium. https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5#:~:text=the%20multiclass%20model%3F-,What%20is%20the%20AUC%20%2D%20ROC%20Curve%3F,capable%20of%20distinguishing%20between%20classes.
Narkhede, S. (2021b, June 15). Understanding confusion matrix. Medium. https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
Navas, J. (2022, February 8). What is hyperparameter tuning?. Anyscale. https://www.anyscale.com/blog/what-is-hyperparameter-tuning
Sethi, A. (2023, June 15). One hot encoding vs. label encoding using Scikit-Learn. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/03/one-hot-encoding-vs-label-encoding-using-scikit-learn/
Supervised learning. scikit. (n.d.). https://scikit-learn.org/stable/supervised_learning.html
V, L. G. (2023, July 25). 4 ways to evaluate your machine learning model: Cross-Validation Techniques (with python code). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2021/05/4-ways-to-evaluate-your-machine-learning-model-cross-validation-techniques-with-python-code/